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ABSTRACT 



A model of a generalized information storage and retrieval system is 
proposed. The model consists of six subsystems (or blocks): logical 

processor s selector * descriptor 9 file 9 locator? document file , and 
analysis block. These subsystems function in a partial environment 
defined by the user and data blocks., Proceeding from a verbal descrip- 
tions a functional representation of each subsystem is developed. The 
functional representation describes not only what is done but also s to 
some degree, how tasks are accomplished within each subsystem. An 
immediate result of the functional representation is the definition 
of a metalanguage for identifying some necessary characteristics of 
higher level languages used in the implementation of information storage 
and retrieval systems. 




INTRODUCTION 



Lack of a recognized * well accepted theory of information retrieval Has 
provided a constant disturbance to some workers in this, field. In the Fall, 1966 
issue of the Forum (the newsletter of the Special Interest Group on Information 
Retrieval), Lauren Doyle [5] refers to the "social turmoil" created by use of the 
term "information retrieval". This upheaval stems, as he notes, from the inability 
of people to accept a common definition of the term. With such disagreement on 
the definition of information retrieval", a more disparate perception of what 
is encompassed by the field is a natural consequence. Recognition and acceptance 
of a theory is believed by some [3] to offer some hope for reducing this diversity 
of views * We admit our membership in this optimistic group, and our purpose Is to 



attempt a small step in the path toward establishment of some fundamental principles. 
The fundamental description offered in this paper is not proposed as a theory.; 
rather, we seek to identify an approach by which a theory could evolve. Characterist 
of this approach are the dual objectives: (1) descriptiveness and (2) generality. 

Descriptiveness is necessary if we are to evolve an accepted theory, l.e, one con- 
tributing to theory users 11 [3]. Generality or the integration of Several seemingly 
distinct entities, characteristics, and/or methods into a single conceptual unit, is 



^ requirement of any theory, but we are determined to avoid the usual corequisite — 
abstraction. Abstraction may prove necessary in subsequent stages of development, 
but our present work is based on the practical objective of describing the functions 
performed within an information retrieval system. 



RELATED WORK 



Several authors have proposed theories of information retrieval or documen- 
tation, and we survey only the recent attempts that include the perspective of 
automatic information retrieval systems, A more comprehensive treatment, exploring 
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various subdisciplines and techniques of mathematics applied in the modeling 
of information retrieval systems and subsystems, is given by Hayes [7], His 
purpose is to identify the role and contribution of mathematical models rather 
than to develop a theory of information retrieval. 

Most theories of information retrieval (IR) begin with a specific aspect of 
the total problem. Jonker [9] offers a theory that deals primarily with the 

classification or indexing aspect. His idea of the descriptive continuum is 

that the existing indexing systems form a continuum based on the average length of 

Index terms. This continuum would have at one extreme the indexing systems using 
single word terms; at the other extreme are the hierarchical classification 
schemes in which the longest possible terms are used. Since the cost of an 
IR system is largely dependent on the indexing task, Jonker [9, p, 1311-1312] 
argues that total system costs are reflected in the position of an indexing 
system on the continuum. More recently, Soergel [12] proposes a formal system 
representation of documentation systems in terms of the classification and query 
search functions. Using primarily a set-theoretic approach, Soergel is able to 
construct a classification of IR systems based on the relationships among descripto 
and query components., i.e . indexing terms in the former case and query terms in the 
latter. Turski [15] proposes a model of an IR system focused on a formal develop- 
ment of the thesaurus concept. 

In his recent text Sal ton [11] summarizes three approaches to modeling IR 
systems. From these models certain theoretical relationships can be derived. 

One approach is based on the search function, l^e. the relationship between the 
specified set of query terms and the resulting document set retrieved. An IR 
system (J) is defined by the triple 
I = (D , R, T) 

where 
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D is th-a finite set of documents, 

R is the request language (finite set of request terms) , 

T is a function mapping R into all possible subsets of D* 

Given the requests r and s from a partially ordered set R, and if the ordering 

(<) is such that s < r, then the retrieval function T: R 2 D defines an 

inclusive retrieval system if' 

s < r -*■ T(r) C_T(s), 

A second approach is to model the IR system with respect to the class if ication 
function* This approach, stemming from the earlier work of Mooers [10], defines 
an IR system as 

I = (D,R,C,X,F) 

where in addition to the document set (D) and request language (R) 

C is the classification language, 

X is the classification function, i.e. X: D -*■ C , and 

F is a function mapping the request language (set) into 
all possible subsets of the classification language, 
i.e . F: R -*• 2C. 

The retrieval function T: R 2 D is then defined in terms of the functions X 

and F, i , e * given the request r the set of documents d returned is 
T (r) - {d j X (d) e F(r)}. 

Mooers uses the V as le, concepts above to classify IR systems: (1) desc rotors 

(association of a set of terms with each document) , (2) characters with hierarchy 

(an hierarchical classification scheme) , and (3) characters with logic (characters 
combined by logical operations). Mooers [10, p. 1332] defines a character as a 
verbal symbol which (a) can be independently manipulated, (b) is primitive (non— 
decomposable), (c) has definite meaning, and (d) is from a finite repertory. 

A third approach discussed by Salton uses graph theory as the modeling 
technique * 
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Other authors have chosen to avoid mathematical developments -_.f an IE 
theory and preferred to concentrate on formulating the fundamental problems, 
for example 5 Swansea [13] gives a though t-provoking discussion of the several 
subproblems — indexing, file organization, and performance requirements — com- 
prising the general XR problem, 

THE FUNCTIONAL LANGUAGE APPROACH 

Concern with the languages of information retrieval has been demonstrated by 
at least three authors* Dolby [4] reviews the population of programming languages 
and discusses their relative capability for IR applications* He concentrates on 
assembly languages, COBOL , FL/I, and several special purpose, primarily string 
and list processing, languages* Vickery [18] relates the function of an I'R 
language to the Indexing and search tasks, providing a description of functions 
that may be performed in some particular systems# Fairthorrie [ 6 j nrcroFes an 
algebraic representation of IR languages that seeks to describe velat i unships 
a mo ng t e r ms 1 \ i t b a & y a i: a m vo c ah u 1 a ry . 

Our approach is to propose a modal of a generalized LR <■; v --m .. ^ The model 
*■ s c o m p v L s e 1 1 o f i ; : t. * - - 1 "■ . s y a t. a ms with distinct f un c t i o n s * We i i m e wo 3. 1 . . ^ f j. r ; . - 
'mat he nine lea i. op-: , ir.joas to represent these subsystem functions. One result of this 
functional rep resent at ion Is to define a m etalanguage describing not only what 
happens within the JR system but, to some degree, how it happens. In this respect 
we differ from previous approaches but at the potential expense of sacrificing 
generality. In this effort we have emphasized descriptiveness* 

^We have shown the model to represent adequately four IR systems described 
in the literature,. QUERY, GIPSY, BtRS and SMART [2], 
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FUNCTIONAL REPRESENTATION 



One difficulty in developing a theory of information retrieval is the 
lack of a well defined, completely comprehensive, existing system. In contrast 



with the physical sciences, no entity is available for our examination. Con- 
sequently, the comparison of the theory with the physical process, i, e. retrieving 
information, is impossible. Thus, as Soergel [12, p, 170] notes, we must begin 
with a preconceived model around which the theoretical framework can be structured. 
We propose a generalized model of an IR system, identifying the six subsystems 
and the environment in which the total system functions, i . e . the user and data 
populations. Each subsystem is called a n block" Cor module), and the user and 
data populations also constitute "blocks’ 1 . The blocks are examined independently s 
and each subsystem is represented in terms of the language requirements for ±m- 
piemen t ing tha t: b lo ck . 

The Generalized Model 



Figure 1 shows the generalised model of an information storage and retrieval 
system (IR system) proposed in the earlier work by Grouch [2], The sCructu?;al 
similarity to models proposed by other authors * notably Vickery [16], is 
acknov/led qe d , hi developing the representation of the IR system* we eonceu t rate 
on the functions executed by or within each subsystem (the rectangular blocks). 
Together die user and data blocks serve to define the partial environment in which 
the system operates. The total environment would include the funders or operators 

of the system with considerations of policy and economies of operation. A brief ; 

1 

description of the relationships among the blocks follows. I 
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The User* (generally assumed to be unfamiliar with mechanized ISR systems or 
digital computers) inputs its query to the system. The query is taken by the 
logical processor which operates on the query and outputs to the selector the 
query in terms of descriptors or index terms. The selector uses the descriptors 
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to search the descriptor- file (or index). The resultant specifications, i.e,, 
pointers to those documents which have successfully satisfied the search according 
to some pre-established criteria, are returned to the selector. The selector , 
which may or may not operate on these specifications, sends the final selected 
specifications to the locator , which uses this information to search the 
doctoneni f-vle. The documents themselves are returned first to the locator and 
from there to the user. 

The second part of the environment definition is the data. Data enters the 
system at the analysis block. The analysis block operates on the input to produce 
two outputs— a representation of the document in terms of descriptors, to be stored 
in the descriptor file along with a pointer to the document in the document file, 
and a reference to the document itself (i.e. , an identifier) to be stored in the 
document file . 

Note the three feedback loops involving the HSBTZ 

(1) from the user to the logical processor and back to the user, 

(2) from the user to the logical processor and selector then back 
to the user, and 

(3) from the us or tc the logical processor , selector , and locator. 
then buck to the user. 

In the first case, the logical processor is asking the user to re-formulate, 
clarify, or augment his query. In the second case, the selector is requesting 
user approval of the selected specifications, i.e . for the user to designate 
from amongst the set those that most accurately describe his needs. 

We assume that all information stored within the system enters through the 
analysis block; thus any information concerning the user , his use of the system, 
or resulting from this use must be viewed as input to the analysis block. 
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Descrlptlon of the Environment 

The environment is described by the user and data blocks. Economic aspects 
of system operation are ignored; so that we actually describe a partial environ- 
ment, Our assumptions about this environment are limited. We consider that the 
user is motivated by a need for information and interacts with the IR system in 
his attempt to satisfy this need. Perhaps being quite unknowledge able of the 
system structure and/or capability ; nevertheless , he is able to supply the initial 
character string in interaction with the system. We designate the input query Y 

to be the sec of all strings initially used to describe the u$£r need for infcr- 

2 

tna tion, “ 

Y:: - {y > 

The second part of the environment, the data block, comprises the raw material 
input to the IR system. We assume this input to be unprocessed textual material 
in the recorded form convenient to the system. Although certain, conventions may 
be followed in compiling this material for input, no manipulation by trained 
personnel prior to entry is assumed. No doubt the form of this raw material can 
influence the system’s processing effectiveness (reducing the retirements for 
automatic content .-tualysis [13, 7] for example), but for our purposes tbia material 
is considered as a set of recorded symbols recognizable by the analysis block. 

This set of recognizable recorded symbols is called a docume n t (D) , i 
D: : = { [a. ] [ [a. ] e A} 

i 1 i 

where each document is composed of a finite number of symbols (characters) 

[<X] ff i « e , single character strings which are members of the finite symbol set A. 

We impose few requirements on the user and data blocks, consequently forcing 

^Ali symbols and notation used, except the operators in Table 1, are defined in 
the Appendix in addition to their definition in the body of the paper. 
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the IR system to accept an increased responsibility at two points — the logical 
processor and analysis blocks. In fact, we see nothing at this time to prevent 
the IR system's serving either a fact retrieval or document retrieval purpose; 
however, the representation of the blocks corresponding to the six IR subsystems 
(the logical processor 3 selector , descriptor file , locator , document file. , and 
analysis block) is oriented toward document retrieval. 

A Language Approach to Functional Representation 

The symbols used in specifying the functional representation are defined in 
the Appendix. Wherever possible we have attempted to follow "conventions" em- 
ployed in programming language definition, or the "usual" mathematical notation. 
Unfortunately, no single set of symbols and no standard terminology are univer- 
sally accepted; hence, we apologize a priori to the reader for our failure to 
adhere to his individual preferencB, 

The operators used in the functional representation are defined in Table 1, 
Basic definitions used in the development of the representation are given in Table 
2 # Necessary additional notation is introduced within the development of each 
block. 
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Operator 


Description/Definition 


Use 


+ 


comparison 


compares element on left side 
of operator to every element 
of the set on the right side 
of the operator 


C) 


parentheses 


alters usual lef t^to-right 
execution of Boolean expression 
by giving higher priority to 
operations to be performed within 
innermost nested parentheses 


© 


relational (= 5 -,> s >,< s <) 


© denotes any member of the set 
of relational operators 


Q 


logical (A ? V) 


o denotes any member of the set 
of logical operators * Both and 
(A) and or (V) have the same 
priority, modified only by the 
presence of parentheses 


oX 


oX: ; [x^x^o, * *° K v(x) ^ 


the application of the operator 
o to the set X to form a string 
(where square brackets denote that 
the contents of the brackets is 
considered a string, and v(X) 
denotes the number of elements in 
the set X) 


I 


[a x I;[a i+l ]!!=Ia i a i+l 3 


string concatenation 



i 

i 

Table 1 # Operators Used in the Functional Representation ) 
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Notation 



Descr ip t ion 



D 

d 

D: ;={d} 

Si :=>{ d} 

R(x) 



Q(D) 



T (d) 

v(x) 

Gs ;={g} 

U: :={u} 

A{ :={[a i ] ,1=1,2, . , , ,v (A) } 



a document 
a descriptor 
set of all documents 
set of all descriptors 

a # contents of record R corresponding 
to record identifier x, or 
b, a set of items associated with 
identifier x s or 

c* a mapping which associates with 
x a set of items R(x) 
set of descriptors associated with 
(describing) document D 

set of documents (document identifiers) 
associated with (described by) descriptor d 
value associated with x 

set of all grammatical constructions (punctualioi 
symbols 9 non— meaningful strings) 
set of all query terms 

the symbol set recognisable by the system 



Table 2, Basic Definitions in the Functional Representation 
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The Logical Processor 

We assume that the query is expressed in a restricted natural language; 
if desirable the degree of restriction could be minor. The primary task of the 
logical processor is to accept the query as input and to produce a reduced query, 
rhe query expressed in the system* s vocabulary, as output to the seleeior* 
Production of the reduced query can be subdivided into the following tasks, 

(1) query recognition - identifying the input string Y as a 
legiti ma te query and possibly performing a syntactic 
analysis of the input either independently or in dialogue 
•with the user to enable modification according to system 
requirements ; 

(2) query reduction — removing all grammatical constructions 
and nonsubstantive words unrelated to the supposed 
"information content" of the query; 

(3) normalization - expanding the query by dictionary reference, 
in the process of translating the input strings into terms 
consistent with the system vocabulary ; and 

(4) pre-search activities - using the formulation of the query 
resulting from the three previous tasks, to allow user feed- 
back in further query modification. 

We can represent the function of the logical processor by beginning wtth 
the query input string Y, 1 , e , 

Y;; ** {y} 

where the set of all substrings y comprise the query Y, An essential assumption 
is the lef t-to-right ordered scan of all character strings, those produced as 
well as those supplied. Thus all strings are examined in a left-to-right order 
unless precedence operators, e.g . the parentheses characters, are present to 
alter this order. We assume the permitted operator set to be composed of the 
precedence operators ( * ) and the logical operators 

o: s = {A S V} 

(with the negation operator omitted) , 
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Let 



G^:-{g} s the set of all grammatical constructions (punctuation) 
and non-substantive terms, and 

U£:={u} s the universal set of all query terms, 



then the set of all possible (Boolean) queries is defined recursively as the 
set of strings 



B* :«U| (B) |BoB 

During the query recognition phase, denoted by the subscript R, the logical 
processor ( L ) acts either to reject the query (if it is not syntactically 
recognizable) or to augment it (in the case of incomplete syntax) 

V <Y U{y}|$> Y , 

where {y} may be null and the rejection, indicated by <p , obviously prompts some error 
message * 



The logical processor begins the query reduction task with a string Y (possibly 
different from that submitted to the. query recognition phase) where 



Y^{lT UG"}and IT g U f G" C G. 

In the query reduction phase the logical processor is applied to construct Y 
(the reduced query) a member of the Boolean query set implicit in Y . 

£ p : P(Y) * Y" 

with all g EG stripped away. The query reduction function P acts on the ordered 
sequence as follows 

P(Y) - {P(y x ), P(y 2 ),..., P (y v (y) ) } 

where v(Y) indicates the number of elements in Y, Note that the order of Y is 
preserved in y" , i,e . for U**: ;={u) then the scan of Y causes to be the first y 

such that y =u , u e U, etc . 

3 _ , ' _____ 

The use of a subscript on a function symbol, e . g . L , serves only to identify a 
particular task of a more comprehensive function iirthis case the logical processor. 

No relationship is intended between the function and the entities to which it is 
applied. 
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For example, the input to the query reduction phase is scanned so as to 
form strings with each string [y] , having one of the properties: 

(1) consisting entirely of meaningful characters, 

(2) consisting entirely of non-meaningful characters 
(punctuation, blanks, articles, etc » ) , or 

(3) consisting entirely of one or more reserved characters 
(parentheses or logical characters) . 

The query reduction operation causes the following value assignments 



u y=u, u £ U 

C y="(" 

) y=")" 

A y= M and" 

V y= l, or" 

A y _ g» g e G 

M(E)LJ y otherwise 

where M(E) is an error message activated by the attempt to process y. 

In the normalization or expansion phase, each meaningful term u is used to 

identify the subset of all descriptors associated with u, N(u), 

£ : Y" -*• p[VN(e)] = Y" 

N 




where N (e ) 



/N(u) i£ e s u 
K e otherwise 



The decomposition function p[oX] breaks the string of terms and logical 
operators into separate elements 

p[oX] : :-p[x 1 ox 2 o. . ,ox v ^ X j ] 

“ {x L ,0 ,0 , .,4,0, X v } 

and forms the set Y'' by its operation on Y". Members of Y" are parentheses, the 
logical operators (A,V) and discriptors (d) from the set of all descriptors 
(2;:={d>). 
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The function of the logical processor in presearch activities would involve 
repetition of these three phases. One can visualize the function of the logical 
processor to be defined in terms of the individual task functions 
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The Selector 

The selector, using the processed form of the query (Y'*') as input, retrieves 
from the descriptor file the set of all documents associated with each descriptor 
(d e Y ) , The indicated logical operations are then performed in the order 
specified (by the use of parentheses). The result is a set of selected specif i- 
cations , ij_e. the set of all document identifiers associated with the initial 
query. In addition, selected specifications, e. R . a document listing, the number 
of documents associated with each d, etc., may be returned to the user. This is 
often termed post-search activity. 

In representing the function of the selector , we must consider the relation- 
ship between this block and the descriptor file . We represent the descriptor file 
essentially as a passive block acted upon by the selector (and the analysis block) . 
Let SJ be the function which evaluates any valid set expression. The function of 
the selector C?^) with respect to the descriptor file is represented as 

S d : Y" *■ QC(d * (p[’TVT(d)")"]) V deY") , V ^ (j , A > f) ) - D' c D 

where D is the set of all documents . 

While appearing complex, the representation of the selector is quite 
straightforward. Consider the simple example 
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Y" = {d x , V , (,d 2> A , d 3 ,)> 
with the following document references 
refers to D^., 

d^ refers to , D,- 

refers to , D_, # 

The actions of the s&t&oior are: 

Cl) 

C2) 

(3) 

(4) 

C5) 



The second function (some times called post-search activity) of the sgZgotov 
is to operate on D ^ in some manner so as to return some aspect of D ^ to the 
user * , We use the notation v(D**) to indicate some "value 11 associated with as 
the output, 

S u ? V* -*■ v (O'*) 

The nature of v(D^) is system dependent, 
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identify the document set T(d) associated with d, e , g , 
for d x - {D 3 , D 5 , D g }> 

"or" the members of the document set (in a sense this 
creates a string) and enclose them in parentheses, e , g , 
for d 1 -(D 3 V D 5 VD 6 ); 

apply the decomposition function to T(d) in its string 
form and replace d by T (d) , e , g . for - (D^, V , V , Pg) 

this is repeated for all d e Y'** to give the result, e . g . 

{(d 3 ,v , D 5 , V, D 6 ), V „ (, (d 4> v 

all logical operators are replaced by the union and intersection 
operators and the result is evaluated by applying Q 

C(d 3 Ud 5 Ud 6 )U <(d 4 U d 5 > fl (d 4 Ud ? ))) 

producing the sat of documents , e , g . 

{D„ s D, , D-, D^.}* 



D s ), A s (D 4 , V , D 7 ),)}; 
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Representlng both functions of the selector (S) requires the execution of the 

document selection followed by post— search activity, i » e > 

Sa=s {SA*}}. 

u d 



The Descriptor File 

Representation of the descriptor file begins with two assumptions: 

(1) The hardware capability of the system is similar 
to that of many existing systems; it includes 
(besides a large mass memory and multiple tape 
units) a number of auxiliary storage devices such 
as disc, drum, and/or data cell. With the possible 
exception of an interactive capability through 
teletypewriter, it includes no specialised hardware 
devices. 

(2) The main concern in dur generalized retrieval 
system is single query processing ( i . e « , the query 
of the individual user) , rather than the batch 
processing of multiple queries , 

The descriptor* file is viewed as passive as wv cote above. We can characterize 
it by representing its organization rather than prescribing any active functions 
performed by It. A similar app^vjich is taken by Hsaio and Harary [8] in representing 
the search functions {selector) as they relate to various file ( descriptor file) 
organizations . 

We consider the system vocabulary to be changing (probably increasing) and 
determined by the criteria invoked in the cznaly sis block (no static thesaurus 
is assumed). Furthermore, we assume no weightings are applied to descriptors. 

The essential task is to represent the process by which the set T(d) is defined 
for the three principal file organization techniques: serial, inverted, and 

multilist . 
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1. The Serial File 



A typical serial file entry is seen as follows* 



V (D) 




Associated with each document D is a set R(D) of terms 

or descriptors, d^« The serial file may then be characterized 

by: 



2 . 



(1) Q (D) : :-R(D) 

(2) T<d): ; = {d v(D) + R<D), ¥ D e D"} 

Thus T(d) ? the set df all documents associated with 
descriptor d, is found by the following process. First 
d is compared to every element of the set R(D) . If d 
is an element of R(D) , v(D) (the associated document tag) 
is returned. The comparison is made for all B contained 
in the set D, 

The Inverted File 

A typical inverted file entry is reproduced below. 



v ( d. ) ;v(d . ) : 
3- ■ J 



: v(D k > 



; v(d > 
, m 



That is, associated with every descriptor d is a set R(d) 
of documents D_^» Inverted file organization can be represented b 

(1) T(d)jj=R(d) 

L; s r*- 

(2) Q(D) : ;={ v<D) + R(d), ¥ d e'A} 

Q 
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3. The Multilist File 



Multilist file organization is somewhat more complex 
than the others, since it involves the use of an additional 
file, frequently called the Directory. Multilist file 



organization is pictured below. 



Directory 



d , 
J 



v(D k ) 



Main File 



v( V 



v(D ± ) 



v(D 1 ) 



v v(D )i I : ... 

l 2 j : i : 


< 

i 




! ; ! , 

! I • • • ! d fy- (Dt*) 

i 1 i J r -t 

• * * . 


• * • 




! 1 i ' 

!••• Id, ; a 

i 1 ! 


* # » 

— 



All main file entries 



are of the form 




where 



v(d.)-v(D_). The diagram shows the directory entry associated 
j h 

with some specified descriptor d, and the corresponding main 



file entries . 



Thus multilist file organization can be characterized by: 

(1) T(d):s-{d , - v + R (R (d) ) / v (d) = A } LjR(d) 

vCd; 

(2) Q(D) : :={v(D) d + X(d)»A d e £} 

In each search of the deeovvptov fits by the selGotor^ the object of the search 
is the set T (d) . 
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The Locator 



Just as the selector searches the descriptor file ± n order to extract the 
documents or document tags associated with each descriptor in the expanded query, 

the locator (R) searches the document file to extract the record associated with 
,ach document in the set <D") passed to it by the seteotor . Th±s record may con _ 

S±St ° f the d ° CUment t±tle * an abstract, or an extract. In any case, the contents 
of the document f%le entry associated with the document are returned to the user 

under the heading of "located documents." We represent the function of the locator 
as simply: 

Hz CR(D) s V DED y l 4S&T 1 

where R(D) is the entire data record (entry) associated with document D, 



The Document Fite 

The document file is composed of entries R(D) which are the IR system's repre- 
sentation of the corresponding documents. Formed by the analysis block, the system's 
representation of each document is determined by the criteria applied there. We 

assume that in every case a unique document identifier v(D) is an entry in the 
document record K (0) . 

Vickery [17] states that document representation may be formed in three 
ways: by simple extraction, by selective extraction, and by the assignment 

of certain keys (e , g . , standard descriptors). The (analysis block may leave the 
document (data) input virtually intact, operating only to construct the document’s 
representation in the descriptor file. Consequently, the entire, unaltered 
document may serve as its representation. 
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The document file, like the descriptor file, is considered a passive block. 

In this case the locator is the active block operating on the document file. 
Similarly, the representation involves defining the file organization which is 
assumed to be simply by the document identifier v(D) or an ordering based on 
frequency of use. In either case the document file is organized according to some 
attribute Cor combination of attributes) of the record R(D) corresponding to the 



document D. 



RCd x ; 



R(d 2 ) 



mm* 


ECD.) 




3 



Thus file organization is represented simply as R(B)* 



- The Analysis Block 



The analysis block constitutes the second entry point for input externa! 
to the IR system (the other being the logical processor ). The function of the 
analysis block is to process the incoming data in order to produce two outputs: 

(!) some indication of the contents of the 
incoming document, to be stored in the 
descriptor file along with a pointer to 
the document in the document file $ and 

(2) a representation of the document itself 

(i . e , 9 the sys tern 1 s representation of the document), 
to be stored in the document file . 

Obtaining the description of the document’s contents is commonly called the 
indexing task* 
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The importance of the indexing task has been noted by several authors 

[16, p, 22], [1, p. 317 ], Automatic indexing techniques fall into four general 
categories i 



Cl) permutation indexing, 

(2) citation indexing 

(3) statistical procedures, or 

(4) syntactic procedures. 

While application of the techniques in each category require quite different 

assumptions and utilize different aspects of the data , they all operate on the 

data with the same objective- to construct a set of descriptors that "... somehow 

~ dJcate < em P haais given originally) the information content of the document ..." 
[1, p. 317]. 

The second major task of the analysis block is the construction and storage 
of a document representation in the document file. This representation would 
include a document identifier, all the elements of a bibliographic reference 

(author, title, publisher, etc.), and might include references, an abstract, and/or 
the complete document text. 



We should also note the possible use of clustering techniques within the 
analysis block. In information retrieval, the object of clustering algorithms is 
to generate groups of associated terms (for use in a thesaurus) or to form 



document clusters facilitating the matching of the analyzed search request with the 
document identifiers. The result is to simplify the retrieval process. 

We view the task of document representation as requiring some of the functions 
employed in the indexing task. Usually the indexing task Is much more complex 
while the document representation may be almost perfunctory. Considering the 
indexing function of the analysis block, Vickery [16, p p . 21-22] recognizes three 
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stages in the assignment of document descriptors t 

(1) scan of the text to derive those words, 
phrases, and/or sentences which best represent 
information content, 

(2) a decision as to which of the descriptors 

are worthy of being recorded in the desoT%p~boT 
fiZe $ in view of the purpose of the system, 
and 

(3) the transformation of the selected descriptors 
into a standard ^descriptor language," the re- 
sulting terms of which serve as the entry or 
entries in the d& s CT'ip ’boi? 

We describe these three stages by two functions, i « e . the string formation function 

(4^) and the descriptor determination function (A A ) - 
F cl 

Recall that a document (B) is defined as a set of strings, i.e , 

B : [a^] | [ci^IeA) 

In its raw, unprocessed form, the data entering the analysis block are members of 
the finite symbol set A;;={[a_ L ], i»l ,2 , . . , ,V (A) } , which can be considered single 
character strings. This set can be partitioned into two subsets 
C T C A and C R C A 

T R 

where C is the set of terminator symbols and C the set of non=terminators and 

c T n c R = <f>. 

Thus we represent the data (a single unprocessed document) as a sec 

D: :»{ [a* ] I [a, ] e A}* 

■ x _ ' x 

The scan of D is assumed to be from left to right. 

The string formation function A ^ operates to form the set of descriptor 
candidates A by concatenating the symbols [ou ] to produce terms. 

A (;[a.] 9 [a.] t C T ■+■ [a], V[a ] e D) = A. 

r X 1 X 

As a consequence of this operation the sat A can be defined as a set of strings 

A: j = {[a3 | Ecu] e C R v[a ± ] e [a]}. 
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The descriptor determination function A d operates on the set A to select those 
descriptors to be inserted in the des 02 f iptox > file. 

A^x A { d | d e Q(D)} 

where Q(D) is the set of all descriptors associated with document D. In this 



manner no limitations are placed on the size of the descriptor vocabulary (A) 
where 3 is the set of all descriptors 
Si.— {d}. 



Again, we use an example to illustrate our representation. Let 

® ~ 'S M ) • !^ ) I l"»“| > / ) (*)} 

c R = {a, b,c,d,e,f,g,h,i,j,k,l ,m,n,o ,p ,q,r,s,t,u,v,w,x } y,z} 

T R. 

then A = C LJ C , A sample from the input string follows: 



THIS BOOK DESCRIBES THE USE OF THE DIGITAL COMPUTER IN THE WORLD OF INDUSTRY, 



COMMERCE, AND BANKING. ... 



Then, the first five symbols are 
Ea x ] = "T" 

[a 2 ] = "H" 

ta 3 ] = "I" 

[on] = "3" 

Io 5 ] = "V 

The operation of A^, results in the set 

A -» { [a] [a] 2 ,..., [a] 17 > 

where [a^ - "THIS", [a],, - "BOOK",..., [a] 17 = "BANKING". 

The criteria used in the selection of descriptors is system dependent. 
Application of to the set A is equivalent to applying the function to each 
member of the set, 1 ,e . for our example 

A d i A - {A d : [a] 1 , A^x ta] 2 A d : [u] 17 ) 
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where A^t [a ] -*• d according to the criteria applied, otherwise 



A^t La J A - If a fixed descriptor vocabulary is used, the descriptor 
qualification A^% [a] d is easily determined. In a system where a fixed 
vocabulary is not used, v (oc) might be used to determine the result. 



Thus the function A ^ can, produce a different set of descriptors Q(D) 
depending on the criteria which are applied* For our example, let us assume 



that 4, operates on the set of strings ^ associated with document D as 
d 

follows i 



4 d 5 


[a] 2 = 


"THIS" A 


^d 1 


[a] ? = A d i 


"BOOK" •+■ A 


A d‘‘ 


[a3 g = A d ', 


"DIGITAL" -> d- 


V 


[a]g = A d : 


"COMPUTER" h - d 2 


^d ! 


la] l(T V 


"IN" ->• A 



Two additional functions remain to be accomplished in the analysis block 7 
and these relate to the file maintenance requirements , For the. desariptoT file- 
the tasks required differ according to the file organization employed. We 
denote the maintenance function required for the deseviptoz* fil& by A ^ and 
represent the activities as follows : 

1. The Serial File 

{d|d e Q(D) } ->• R(D) 

2 . The Inverted File 

X M s {v(D)UR(d),Vd E Q(D)} + RCd) 
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3, The Multilist File 

4 m : {d | v(d) = A , V d e Q(D)> -* R(D) 
v(D) ->• R(d) , V d 3 d £ {Q (D) n S c > 



^v(d) ^ RCR(d) ) / v(d) = A 
v(D) v(d) 



^ lid 3 d £ {Q(D) D A} 



Note that in each case the file maintenance function begins with the set 
Q(D) produced by 4^, 

The serial and inverted file maintenance functions are simple. In the 
serial file the descriptor set is assigned to a document record; while the 
inverted file requires the addition of a document identifier to the set of 
document identifiers referenced by each descriptor d s For the multilist file, 
the first operation refers to the formation of a main file record , the second 



describes the formation of a new directory record, and the third describes 



the setting of the main file link* 

After determination of the descriptor set Q(D) and its subsequent use 
in file maintenance functions, the analysis block operates on the original tent 
input to construct and/or maintain the doaiortent filg , This function involves 
only the construction of the document record R(B) and the addition of the 
document to the set of all documents, 

(d' -f r(d) ) u {r(d) , d e 5} {R(D) , d e 5} 

In summary, the function of the anaZys'is block,' (A) can’ be represented as 

An= *}}} * 

where the angular brackets enclose an ordered pair. 
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OBSERVATIONS ON THE 'FUNCTIONAL REPRESENTATION 

Figure 2 provides a summary of the functional representation of an IR 
system. The verbal description of each block is replaced by the functional 
representation* The descriptor fi Ze and document file are indicated to be 
passive entities by the nature of their interaction with the selector 5 
locator , and analysis blocks. In addition to representing the function of 
each block, the production or output of one block that serves as input to 
another is identified for each interaction between active blocks. 

Several observations on the functional representation seem appropriate* 
First, our purpose is to describe not only what is done in an IR system but 
also, to some degree, how it is done. The functional representation serves 
this purpose, and in so doing defines a metalanguage for IR languages. Second, 
we have strived for descriptiveness at the possible sacrifice of generality. 

Our contention is that an immediate result of the functional representation 
is a metalanguage that provides useful information on the capability of higher 
level programming languages used to implement either the entire IR system or 
any subsystem. In the earlier work by Crouch [2], an algorithm, based on the 
metalanguage, ie described that provides a quantitative evaluation of the data 
structure and operator capabilities of several programming languages. 

Finally, the functional representation furnishes a direction that offers 
some promise in the identification of common ideas, practices, and methods 
and the eventual integration of these into a coherent body of concepts. A 
theory evolving in this direction should prove sufficiently descriptive to 
"theory users 1 ’* 
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Figure 2, The Metalanguage Description of the Generalized Information Storage and Retrieval System 
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APPENDIX 



Notation 



{a j condition} 

< > 

{ } 

[ ] 

$ 

e 



© 



© 

a 



VD 



9 

A,V 

A 

u 



I! ff 

• • • 



n 

A* 



Description 
definition sign 
equ iva 1 en c e 

M set of all a, for which condition holds 

an ordered pair 

any s et 

any string 

null set 

,! is an element of" 

replacement (or assignment) 

any binary relation (=* 9 > 

any Boolean relation (A S V) 

the set evaluation function 

function operator 

{D.j 1 « 1, 2,,..,n} where D ± eD 
("for all D") 

"such that" 

cor. j unction , dis j unc t ion 
null field 
union of sets 

alternative (a|b : :^A or B) 
literal string delimiters 
intersection of sets 

the complement of set A (A * i A} 
blank character 




Table Al , Notation Used in the Functional Representation 



X 



t 



f(zO 



£ (x) / y©z : := 



{£ (x) / yoz} t i- 



( t if x£f (z) 

\ # otherwise 

Note: t is the value returned (ti* 

subscript) 

x r 

, >- f (r) r 

! — false yez 

true return r 

x r 

f (r) + r 

return r 
— false y©z 
true 



Table A2 „ Definition of Special Functions 
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